Hamming Clustering Guidelines
Hamming clustering is an error correction method that improves debarcoding accuracy by reassigning cells from low-count patterns to nearby high-count patterns. This tutorial explains how it works and provides parameter tuning guidelines.
What is Hamming Distance?
X-Code barcodes can be represented as binary patterns, where each position indicates whether the cell carries (1) or does not carry (0) the n-th elementary DNA sequence. The Hamming distance between two such patterns is the number of positions where they differ. For example:
Pattern A: 1 1 0 1 0 0 1 0 0
Pattern B: 1 0 0 1 0 1 1 0 0
^ ^
Hamming distance = 2 (differ at positions 2 and 6)
In debarcoding, patterns that are close in Hamming distance are likely to be related: one may be a noisy version of the other.
Why Hamming Clustering?
After debarcoding, you typically see:
True barcodes: High cell counts, these are the real barcode populations
Noisy children: Low cell counts, usually within small Hamming distance of a true barcode (the “parent”)
Noisy children arise from measurement errors: a few channels misclassified as ON instead of OFF (or vice versa), depending on the separation between the OFF and ON intensity modes. Hamming clustering identifies these noisy patterns and reassigns their cells to the likely parent barcode.
If two true barcodes are rarely Hamming neighbors (which we can compute from X-Code geometry), then a small pattern near a large pattern is almost certainly noise, not a real barcode.
Theoretical Background: X-Code Geometry
The 4-of-9 Block Structure
X-Code barcodes use a combinatorial scheme where each block of 9 channels has exactly 4 channels ON. This structure is useful for understanding error correction because it constrains which patterns are valid:
Each block has \(\binom{9}{4} = 126\) valid patterns
For \(k=2\) blocks (18 channels): \(L_2 = 126^2 = 15,876\) valid barcodes
For \(k=3\) blocks (27 channels): \(L_3 = 126^3 = 2,000,376\) valid barcodes
Since each block must maintain exactly 4 ON channels, the minimum non-zero Hamming distance between two valid patterns within a block is 2 (flip one ON to OFF and one OFF to ON). This means:
Valid X-Codes always have even Hamming distances from each other
An odd Hamming distance between a valid barcode and another pattern indicates that the latter is invalid
Hamming Neighbors in X-Code Space
For a single 4-of-9 block, the number of valid patterns at each Hamming distance from any pattern:
Distance |
Count |
Formula |
|---|---|---|
0 |
1 |
(self) |
2 |
20 |
\(\binom{4}{1} \times \binom{5}{1}\) |
4 |
60 |
\(\binom{4}{2} \times \binom{5}{2}\) |
6 |
40 |
\(\binom{4}{3} \times \binom{5}{3}\) |
8 |
5 |
\(\binom{4}{4} \times \binom{5}{4}\) |
For multiple blocks, computing neighbors at radius > 2 is more complex because distances can be distributed across blocks. At radius 2, each barcode has exactly \(D_k = 20k\) valid neighbors. Here are the neighbor counts for common configurations:
Radius |
18ch (2 blocks) |
27ch (3 blocks) |
|---|---|---|
2 |
40 |
60 |
4 |
520 |
1,380 |
6 |
2,480 |
15,320 |
8 |
5,210 |
87,615 |
Detailed Formula for Multiple Blocks
Single block: distance distribution
For one block (length-9 binary vector with exactly 4 ones), to get from one valid pattern to another at Hamming distance \(2d\):
Flip \(d\) of the 1s to 0s
Flip \(d\) of the 0s to 1s
Number of ways:
The distance-index polynomial is:
where the exponent \(d\) corresponds to Hamming distance \(2d\).
Multiple blocks: distances
For \(k\) blocks (called n_blocks in the code), if a barcode is at total Hamming distance \(2M\) from the reference:
Let block \(b\) contribute distance \(2d_b\) where \(d_b \in \{0,1,2,3,4\}\)
Total distance constraint: \(\sum_{b=1}^{k} d_b = M\)
For a fixed \((d_1, \ldots, d_k)\) with that sum, the count is:
The total number of barcodes at distance \(2M\) is:
The coefficient of \(x^M\) in \(P(x)^k\) is the number of barcodes at Hamming distance \(2M\) from a given reference barcode in a \(k\)-block code.
The Sublibrary Problem
In practice, experiments use a small sublibrary of \(S\) barcodes from the full library of \(L_k = 126^k\) valid patterns (where \(k\) is the number of blocks). When the sublibrary size is small compared to the full library size, two randomly selected true barcodes are unlikely to be Hamming neighbors.
Expected Number of Neighbor Pairs
Let:
\(k\) = number of 4-of-9 blocks
\(L_k = 126^k\) = total valid barcodes
\(S\) = sublibrary size (number of true barcodes)
\(D_k = 20k\) = number of Hamming-distance-2 neighbors per barcode
Then the expected number of unordered neighbor pairs in a random sublibrary of size \(S\) is:
Sublibrary Size (S) |
18ch (\(L_2\)=15,876) |
27ch (\(L_3\)=2,000,376) |
|---|---|---|
30 |
1.10 |
0.0130 |
100 |
12.47 |
0.1485 |
200 |
50.14 |
0.5969 |
500 |
314.33 |
3.7418 |
1000 |
1258.58 |
14.98 |
2000 |
5036.85 |
59.96 |
3000 |
11334.80 |
134.93 |
5000 |
31489.76 |
374.85 |
Conclusion: Hamming clustering is much safer with 27-channel (2M library) barcodes. For 18-channel barcodes with large sublibraries (S > 1000), true barcode collisions become common.
Safeguards Against Incorrect Remapping
The situation is not as dramatic as the collision counts might suggest. Hamming clustering uses multiple safeguards:
Count ratio test: A pattern is only remapped if the neighbor has significantly more cells (controlled by
ratioparameter)Confidence test: By default, the absorbing neighbor must have higher median confidence than the pattern being absorbed
These safeguards ensure that even when true barcodes happen to be neighbors, the smaller one is only absorbed if it truly looks like noise.
Clustering Methods
Both methods are inspired by starcode, adapted for X-Code geometry.
Message Passing (Default)
The message passing method (method='msg') iteratively merges small patterns into larger neighbors:
Sort patterns by count (ascending), then by median confidence (ascending)
For each pattern p (smallest first):
Find valid neighbors within radius
If p is invalid: merge unconditionally into nearest valid neighbor (no ratio test)
If p is valid: check if neighbor q passes count ratio test:
Count ratio: \(\mathrm{count}(q) \geq \mathrm{ratio} \times \mathrm{count}(p)\)
If test passes, merge p into q (transfer all cells)
Update counts
Repeat until no more merges possible
The method is iterative: small patterns merge into medium-sized ones, which may then merge into larger ones. After all merges, chains are resolved by following each pattern to its final center.
Sphere Clustering
The sphere method (method='sphere') identifies local maxima as cluster centers:
Find all valid patterns that are local maxima (no neighbor has higher count)
Filter centers by minimum count threshold (
min_count_center)For each non-center pattern:
Find closest center within radius
If pattern is invalid: merge unconditionally into nearest valid center
If pattern is valid: merge only if ratio and confidence tests pass
Unlike message passing, sphere clustering assigns patterns directly to pre-identified centers without iterative chain formation. This is useful when message passing creates unwanted merge chains.
Tie Breaking
When multiple neighbors are equidistant and pass all tests, the tie_break parameter controls resolution:
Method |
Description |
When to use |
|---|---|---|
|
Linear Discriminant Analysis on intensity data |
Default; best accuracy |
|
Choose neighbor with highest count |
Fast and more aggressive |
|
Skip remapping for ties |
Conservative; avoid errors |
Whereas starcode splits counts equally among tied neighbors, we use the optimal Gaussian LDA boundary based on the intensity distributions of the neighbors (including the source pattern distribution itself minus the cell event looking to be remapped) on the channels that differ between them. This provides a more informed tie-break decision using the actual measurement values. With LDA, it is possible that a cell event does not get remapped if it does not look likely that it arose from another pattern based on its intensities.
Simulation Results
The following results are from simulations varying sublibrary size, mean channel error rate, and barcode population structure (ground truth concentration of counts within the sublibrary).
Message Passing Performance
The figure below shows message passing performance across different scenarios (all using PC-GMM debarcoding):
Message passing performance summary across simulations. Left panels show 18-channel simulations; right panels show 27-channel simulations. a1, a2) ΔF1 heatmap: F1 improvement over baseline across sublibrary sizes and ratio settings. b1, b2) B/H ratio heatmap: beneficial-to-harmful remap ratio (log scale); values >1 indicate more beneficial than harmful remaps. c1, c2) Change in percentage of barcodes detected. d1, d2) Change in Spearman correlation. e1, e2) Beneficial (solid bars, positive) vs harmful (hatched bars, negative) remap counts.
Performance strongly depends on sublibrary size. The 27-channel (2M library) configuration is much safer across all metrics. For 18-channel barcodes with large S, benefits diminish and risks increase.
Message Passing vs Sphere
Comparison of message passing and sphere methods. Left panels show 18-channel simulations (ratio=20); right panels show 27-channel simulations (ratio=2). a1, a2) ΔF1: F1 improvement over baseline. b1, b2) B/H ratio: beneficial-to-harmful remap ratio (log scale). c1, c2) Beneficial (solid bars, positive) vs harmful (hatched bars, negative) remap counts.
The two methods show similar performance in most cases, with message passing slightly better overall, especially for low barcode counts. The sphere method is useful when you want explicit center control or when message passing creates unwanted merge chains.
Parameter Guidelines
Recommended Settings
18-channel (2 blocks):
Default:
ratio=20(safe)S ≤ 100: can lower to
ratio=5-10for more aggressive correctionS ≥ 3000: marginal benefit, consider skipping Hamming clustering
27-channel (3 blocks):
Default:
ratio=5(safe)S ≤ 3000: can lower to
ratio=2for maximum correctionS > 3000: keep
ratio=5
Quick Reference
Library |
Sublibrary Size (S) |
ratio |
radius |
Notes |
|---|---|---|---|---|
18ch (2 blocks) |
S ≤ 100 |
5-10 |
2 |
Aggressive correction safe |
18ch (2 blocks) |
100 < S < 3000 |
15-20 |
2 |
Default settings |
18ch (2 blocks) |
S ≥ 3000 |
- |
- |
Consider skipping |
27ch (3 blocks) |
S ≤ 3000 |
2-5 |
2 |
Aggressive correction safe |
27ch (3 blocks) |
S > 3000 |
5 |
2 |
Keep conservative |
Other Parameters
Parameter |
Default |
Notes |
|---|---|---|
|
‘lda’ |
Use ‘no_remap’ for maximum safety |
|
1 |
For sphere method, set based on expected population size |
Example Workflow
import warnings
warnings.filterwarnings('ignore')
import xcode_debarcode as xd
import plotly.io as pio
pio.renderers.default = 'notebook'
DATA_DIR = '../../data'
# Load 27-channel dataset
adata = xd.io.read_data(f'{DATA_DIR}/2M_test.fcs')
adata = xd.io.map_channels(adata, f'{DATA_DIR}/barcode_channel_mapping_27ch.csv')
print(f"Loaded {adata.n_obs:,} cells, {adata.n_vars} channels")
Channel mapping complete
Barcode channels: 27 (stored in adata.uns['barcode_channels'])
Other channels: 22
Total channels: 49
Loaded 34,071 cells, 49 channels
# Standard pipeline
adata = xd.preprocessing.transform(adata, method='log', verbose=False)
adata = xd.debarcode.debarcode(adata, method='pc_gmm', layer='log', verbose=False)
Apply Hamming Clustering
Since this is the 2M library (3 blocks, 27 channels) and we expect only 4 true barcodes, we can safely use an aggressive ratio=2 setting.
adata = xd.postprocessing.hamming_cluster(
adata,
assignment_col='pc_gmm_assignment',
confidence_col='pc_gmm_confidence',
method='msg',
radius=2,
ratio=2.0,
verbose=True
)
Hamming clustering: method=msg, radius=2, ratio=2.0, ratio_metric=count, tie_break=lda
Eligible: all cells (34071)
Remapped: 10043/34071 cells (29.5%)
Ties encountered: 2968, LDA applied: 2112
[+] Clustering saved in adata.uns['hamming_clustering']['pc_gmm_assignment']
[+] Results stored in:
- adata.obs['pc_gmm_hamming_assignment']
- adata.obs['pc_gmm_hamming_remapped']
After Hamming clustering:
adata.obs['pc_gmm_hamming_assignment']: corrected assignmentsadata.obs['pc_gmm_hamming_remapped']: boolean mask of remapped cells
Note: Confidence values are not modified by Hamming clustering. Continue to use pc_gmm_confidence for filtering.
Visualizations
Barcode Rank Histogram
Before Hamming:
fig = xd.plots.plot_barcode_rank_histogram(adata, assignment_col='pc_gmm_assignment')
fig
After Hamming:
fig = xd.plots.plot_barcode_rank_histogram(adata, assignment_col='pc_gmm_hamming_assignment')
fig
Hamming Graph
The Hamming graph shows patterns as nodes connected by edges when they are within a given Hamming distance. Node size represents cell count, and color can represent confidence or validity.
Before Hamming (shows noisy children connected to true barcodes):
fig = xd.plots.plot_hamming_graph(
adata,
assignment_col='pc_gmm_assignment',
confidence_col='pc_gmm_confidence',
radius=2,
min_count=50
)
fig
Building Hamming graph with 33 patterns...
Nodes: 33, Edges: 57
The visualization confirms that there are 4 real barcodes, each surrounded by a halo of noisy children within Hamming distance 2.
After Hamming:
fig = xd.plots.plot_hamming_graph(
adata,
assignment_col='pc_gmm_hamming_assignment',
confidence_col='pc_gmm_confidence',
radius=2,
min_count=50
)
fig
Building Hamming graph with 25 patterns...
Nodes: 25, Edges: 23
Hamming clustering has considerably reduced the number of noisy children by reassigning them to their parent barcodes.
Hamming Heatmap
The Hamming heatmap shows pairwise distances between patterns. Patterns within the specified radius appear as colored cells.
Before Hamming:
fig = xd.plots.plot_hamming_heatmap(
adata,
assignment_col='pc_gmm_assignment',
hamming_radius=4,
min_count=50
)
fig
Computing Hamming distance matrix for 33 patterns...
Patterns clustered by hierarchical clustering
After Hamming:
fig = xd.plots.plot_hamming_heatmap(
adata,
assignment_col='pc_gmm_hamming_assignment',
hamming_radius=4,
min_count=50
)
fig
Computing Hamming distance matrix for 25 patterns...
Patterns clustered by hierarchical clustering
Summary
Hamming clustering corrects noise by reassigning cells from low-count patterns to nearby high-count parents
Safety depends on sublibrary size: 27-channel is much safer than 18-channel for large S
Multiple safeguards (ratio, confidence) protect against incorrect remapping
Start with recommended defaults, adjust ratio based on your configuration
Always compare before/after using barcode rank histograms
Confidence values are not modified by Hamming clustering